41 research outputs found

    Fedora 3.0 and METS: A Partnership for the Organization, Presentation and Preservation of Digital Objects

    Get PDF
    4th International Conference on Open RepositoriesThis presentation was part of the session : Fedora User Group PresentationsDate: 2009-05-21 10:30 AM – 12:00 PMFedora is being implemented in many different kinds of repositories even within a single institution, e.g., institutional repositories, and preservation repositories, and metadata repositories. Within many institutions, METS (Metadata Encoding & Transmission Standard http://www.loc.gov/standards/mets/) is being used to encode and package content files and metadata for many of the digital objects within these repositories. Much has been speculated about how the two could work together, particularly with the expansion of "content models" within Fedora 3.0. In this proposal, 3 different academic institutions will discuss decisions, plans, and issues arising out of the implementation of a "Paged Text" content model that incorporates the use of METS for various purposes related to the management of metadata for this type of digital object during its lifecycle. Within two 20 minute presentations, each presenter will provide the context for the type and purpose of the repositories being discussed within his/her institution as well as the related services that pertain to the discussion. In addition, each presenter will explain for what purpose METS is being used within the repositories, e.g., to "stage" content and metadata as a pre-SIP target or organizer for vendors, and/or to package content files and metadata for export to preservation. Areas of discussion will include how METS is or potentially could be used in conjunction with the more generalizable mechanisms built within Fedora to manage the structure of a digital object, the disseminators interacting with a digital object (such as page turners for text), and the workflow associated with different "moments" within the lifecycle of the digital object. Presenters will discuss lessons learned as well as future areas of exploration as the Fedora and METS communities continue to work together to optimize the use of each when it makes sense to do so. Questions and discussion from the audience will be encouraged

    EJT editorial standard for the semantic enhancement of specimen data in taxonomy literature

    Get PDF
    This paper describes a set of guidelines for the citation of zoological and botanical specimens in the European Journal of Taxonomy. The guidelines stipulate controlled vocabularies and precise formats for presenting the specimens examined within a taxonomic publication, which allow for the rich data associated with the primary research material to be harvested, distributed and interlinked online via international biodiversity data aggregators. Herein we explain how the EJT editorial standard was defined and how this initiative fits into the journal's project to semantically enhance its publications using the Plazi TaxPub DTD extension. By establishing a standardised format for the citation of taxonomic specimens, the journal intends to widen the distribution of and improve accessibility to the data it publishes. Authors who conform to these guidelines will benefit from higher visibility and new ways of visualising their work. In a wider context, we hope that other taxonomy journals will adopt this approach to their publications, adapting their working methods to enable domain-specific text mining to take place. If specimen data can be efficiently cited, harvested and linked to wider resources, we propose that there is also the potential to develop alternative metrics for assessing impact and productivity within the natural science

    Community next steps for making globally unique identifiers work for biocollections data

    Get PDF
    Biodiversity data is being digitized and made available online at a rapidly increasing rate but current practices typically do not preserve linkages between these data, which impedes interoperation, provenance tracking, and assembly of larger datasets. For data associated with biocollections, the biodiversity community has long recognized that an essential part of establishing and preserving linkages is to apply globally unique identifiers at the point when data are generated in the field and to persist these identifiers downstream, but this is seldom implemented in practice. There has neither been coalescence towards one single identifier solution (as in some other domains), nor even a set of recommended best practices and standards to support multiple identifier schemes sharing consistent responses. In order to further progress towards a broader community consensus, a group of biocollections and informatics experts assembled in Stockholm in October 2014 to discuss community next steps to overcome current roadblocks. The workshop participants divided into four groups focusing on: identifier practice in current field biocollections; identifier application for legacy biocollections; identifiers as applied to biodiversity data records as they are published and made available in semantically marked-up publications; and cross-cutting identifier solutions that bridge across these domains. The main outcome was consensus on key issues, including recognition of differences between legacy and new biocollections processes, the need for identifier metadata profiles that can report information on identifier persistence missions, and the unambiguous indication of the type of object associated with the identifier. Current identifier characteristics are also summarized, and an overview of available schemes and practices is provided

    Integrating and visualising primary data from prospective and legacy taxonomic literature.

    Get PDF
    Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovery and description, and 2) the prevalence of rarity in species descriptions

    The Plazi Workflow: The PDF prison break for biodiversity data

    No full text
    The Swiss NGO Plazi (http://plazi.org) has developed an automated workflow for liberating data, including images and text, from new taxonomic publications issued in PDF format. This stepwise process extracts, article metadata, illustrations and their captions, bibliographic references, scientific names, named geographic entities such as coordinates and country names, collection codes, and finally, taxonomic treatments. These extracted data are enhanced and published in TreatmentBank (http://plazi.org) and deposited in Biodiversity Literature Repository (https:/biolitrepo.org) respectively, in which a Digital Object Identifier (DataCite DOI) is minted for articles as well as their contained figures and taxon treatments, each linked to each other in their metadata. This input is complemented by the import of Journal Article Tag Suite/Taxpub XML based publications from Pensoft publishers (e.g. Zookeys, Journal of Hymenoptera Research; https://pensoft.net/browse_journals) that are semantically enhanced during their journal production workflow. Upon import, materials citation are discovered and parsed, and the taxonomic treatments added to TreatmentBank where a persistent identifier is minted. From TreatmentBank data from taxonomic treatments, including occurence data from cited specimens, are submitted to GBIF (http://gbif.org), or are accessible via API. Treatments and material citations from more than 26,200 articles have been registered. The articles can be found on GBIF using the Digital Object Identifier in the search field.  Plazi, together with Pensoft Publishers, has processed over 26,000 articles containing more than 284,000 taxonomic treatments, 190,000 images, 50,000 georeferenced materials citations, together comprising an estimated 100 million facts. Through the support of the Arcadia Fund (https://www.arcadiafund.org.uk/) Plazi's processing is expanding to cover a sufficient number of journals to liberate the data of over 50% of the new described animal species annually. This will complement an existing service provided to the Muséum National d’Histoire Naturelle, Paris, to convert the European Journal of Taxonomy and their other journals (http://sciencepress.mnhn.fr/en/periodiques/adansonia/40/1) to JATS/TaxPub (https://www.ncbi.nlm.nih.gov/books/NBK47081), as well as an increasing portfolio of journals published in JATS/TaxPub by Pensoft Ltd

    The Standards behind the Scenes: Explaining data from the Plazi workflow

    No full text
    As part of the CETAF COVID19 task force, Plazi liberated taxonomic treatments, figures, observation records, biotic interactions, taxonomic names, and collection and specimen codes involving bats and viruses from scholarly publications with the intention to create open access, findable, accessible, interoperable and reusable data (FAIR). The data is accessible via TreatmentBank and the Biodiversity Literature Repository (BLR) and it is continually harvested and reused by the Global Biodiversity Information Facility (GBIF) and Global Biotic Interactions (GloBI). This data was processed, enhanced and liberated by the Plazi workflow, which involves a dedicated infrastructure including a desktop application (GoldenGate Imagine) that converts portable document format files (PDF) to a dedicated open compressed file format (Image Markup File (IMF)) that is responsible for the data enhancement. To enhance the data contained in the publications, including the biological interactions, a series of standards and vocabularies are used. To the exception of TaxPub, which is a taxonomic specific extension of the U.S. National Center for Biotechnology Information's (NCBI) Journal Article Tag Suite (JATS), all other used vocabulary were previously proposed. This goes along with Plazi’s mission to reuse standards unless they are not available. The following standards of vocabularies are used: Metadata Object Description Schema (MODS) to model article metadata information on Plazi’s XMLs; Darwin Core for taxonomic ranks and materials citation related data; Open Biological and Biomedical Ontology (OBO); Relations Ontology for biological interactions between organisms. The latter two are also used in the custom metadata in the Biodiversity Literature Repository at Zenodo.In this presentation we will provide an overview of the different types of data followed by the standards or vocabularies applied for every and each one of them and their parts. The goal is to provide the context on how the data liberated by Plazi is described, which is extensively reused by third-party applications such as GBIF or GloBI. The use of the standards allows fully automated, daily data ingests by GBIF.

    Deliverable 3.3 (D3.3): Updated release and report on publication data-mining software

    No full text
    uploaded by Plaz

    XML schemas and mark-up practices of taxonomic literature

    Get PDF
    We review the three most widely used XML schemas used to mark-up taxonomic texts, TaxonX, TaxPub and taXMLit. These are described from the viewpoint of their development history, current status, implementation, and use cases. The concept of “taxon treatment” from the viewpoint of taxonomy mark-up into XML is discussed. TaxonX and taXMLit are primarily designed for legacy literature, the former being more lightweight and with a focus on recovery of taxon treatments, the latter providing a much more detailed set of tags to facilitate data extraction and analysis. TaxPub is an extension of the National Library of Medicine Document Type Definition (NLM DTD) for taxonomy focussed on layout and recovery and, as such, is best suited for mark-up of new publications and their archiving in PubMedCentral. All three schemas have their advantages and shortcomings and can be used for different purposes

    EJT editorial standard for the semantic enhancement of specimen data in taxonomy literature

    No full text
    This paper describes a set of guidelines for the citation of zoological and botanical specimens in the European Journal of Taxonomy. The guidelines stipulate controlled vocabularies and precise formats for presenting the specimens examined within a taxonomic publication, which allow for the rich data associated with the primary research material to be harvested, distributed and interlinked online via international biodiversity data aggregators. Herein we explain how the EJT editorial standard was defined and how this initiative fits into the journal’s project to semantically enhance its publications using the Plazi TaxPub DTD extension. By establishing a standardised format for the citation of taxonomic specimens, the journal intends to widen the distribution of and improve accessibility to the data it publishes. Authors who conform to these guidelines will benefit from higher visibility and new ways of visualising their work. In a wider context, we hope that other taxonomy journals will adopt this approach to their publications, adapting their working methods to enable domain-specific text mining to take place. If specimen data can be efficiently cited, harvested and linked to wider resources, we propose that there is also the potential to develop alternative metrics for assessing impact and productivity within the natural sciences

    OpenBiodiv-O: ontology of the OpenBiodiv knowledge management system

    No full text
    Abstract Background The biodiversity domain, and in particular biological taxonomy, is moving in the direction of semantization of its research outputs. The present work introduces OpenBiodiv-O, the ontology that serves as the basis of the OpenBiodiv Knowledge Management System. Our intent is to provide an ontology that fills the gaps between ontologies for biodiversity resources, such as DarwinCore-based ontologies, and semantic publishing ontologies, such as the SPAR Ontologies. We bridge this gap by providing an ontology focusing on biological taxonomy. Results OpenBiodiv-O introduces classes, properties, and axioms in the domains of scholarly biodiversity publishing and biological taxonomy and aligns them with several important domain ontologies (FaBiO, DoCO, DwC, Darwin-SW, NOMEN, ENVO). By doing so, it bridges the ontological gap across scholarly biodiversity publishing and biological taxonomy and allows for the creation of a Linked Open Dataset (LOD) of biodiversity information (a biodiversity knowledge graph) and enables the creation of the OpenBiodiv Knowledge Management System. A key feature of the ontology is that it is an ontology of the scientific process of biological taxonomy and not of any particular state of knowledge. This feature allows it to express a multiplicity of scientific opinions. The resulting OpenBiodiv knowledge system may gain a high level of trust in the scientific community as it does not force a scientific opinion on its users (e.g. practicing taxonomists, library researchers, etc.), but rather provides the tools for experts to encode different views as science progresses. Conclusions OpenBiodiv-O provides a conceptual model of the structure of a biodiversity publication and the development of related taxonomic concepts. It also serves as the basis for the OpenBiodiv Knowledge Management System
    corecore